Venues According to Wikipedia¶

As well as driving the Wikipedia website, a lot of facts that are recorded as such in Wikipedia are exposed in a special sort of database known as a Linked Data database called DBPedia.

So for example, on Wikipedia we have The Charlotte and on DBPedia we also have the The Charlotte, albeit in a form that's more intended for machine consumption.

With some help form code, and a query language called SPARQL, we can write queries over the facts known to Wikipedia...

...like a list of venues, with locations...

Some Code Stuff¶

..cos this is what does the work and by writing code here we can reuese it and write less code elsewhere...

In [9]:

%%capture
#Install some essential packages
%pip install SPARQLWrapper pandas folium
# Import the necessary packages
from SPARQLWrapper import SPARQLWrapper, JSON

# Add some helper functions

# A function that will return the results of running a SPARQL query with 
# a defined set of prefixes over a specified endpoint.
# It follows the same five-step process apart from creating the query, which 
# is provided as an argument to the function.
def runQuery(endpoint, prefix, q):
    ''' Run a SPARQL query with a declared prefix over a specified endpoint '''
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(prefix+q) # concatenate the strings representing the prefixes and the query
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()
    
# Import pandas to provide facilities for creating a DataFrame to hold results
import pandas as pd

# Function to convert query results into a DataFrame
# The results are assumed to be in JSON format and therefore the Python dictionary will have  
# the results indexed by 'results' and then 'bindings'. 
def dict2df(results):
    ''' A function to flatten the SPARQL query results and return the column values '''
    data = []
    for result in results["results"]["bindings"]:
        tmp = {}
        for el in result:
            tmp[el] = result[el]['value']
        data.append(tmp)

    df = pd.DataFrame(data)
    return df

# Function to run a query and return results in a DataFrame
def dfResults(endpoint, prefix, q):
    ''' Generate a data frame containing the results of running
        a SPARQL query with a declared prefix over a specified endpoint '''
    return dict2df(runQuery(endpoint, prefix, q))
        
# Print a limited number of results of a query
def printQuery(results, limit=''):
    ''' Print the results from the SPARQL query '''
    resdata = results["results"]["bindings"]
    if limit != '':
        resdata = results["results"]["bindings"][:limit]
    for result in resdata:
        for ans in result:
            print('{0}: {1}'.format(ans, result[ans]['value']))
        print()

# Run a query and print out a limited number of results
def printRunQuery(endpoint, prefix, q, limit=''):
    ''' Print the results from the SPARQL query '''
    results = runQuery(endpoint, prefix, q)
    printQuery(results, limit)

The query language is a bit weird and can get a bit hard to read, so we make aliases to simplify the clutter in our queries...

In [2]:

# Define any prefixes
prefix = '''
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX dbpedia: <http://dbpedia.org/resource/>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX dbc: <http://dbpedia.org/resource/Category:>
    PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
    
    PREFIX ouseful:<http://ouseful.info/>
'''

The endpoint is where machines go to ask questions...

In [24]:

#Declare the DBPedia endpoint
endpoint="http://dbpedia.org/sparql"
sparql = SPARQLWrapper(endpoint)

Music Venues in England¶

Now let's phrase a question...

DBPedia, can you give me the names and geo-coordinates of venues in England?

In [3]:

q = '''
SELECT DISTINCT ?venue_name ?lat ?lon
WHERE {

    ?venue rdfs:label ?venue_name .
    
    ?venue geo:lat ?lat .
    ?venue geo:long ?lon .
    
    ?venue dct:subject ?is_en_venue .
    ?is_en_venue skos:broader dbc:Music_venues_in_England .
    
    FILTER (langMatches(lang(?venue_name), "en"))
    
} LIMIT 1000
'''

Now we actually pose the question and get a response back..

In [4]:

df = dfResults(endpoint, prefix, q)
df

Out[4]:

	venue_name	lat	lon
0	The Coronet	51.4948	-0.0989
1	Pigalle Club	51.5095	-0.135
2	The Spitz	51.5197	-0.0747222
3	First Direct Arena	53.8031	-1.54222
4	First Direct Arena	53.8031	-1.54222
...	...	...	...
259	The Ram Folk Club	51.3823	-0.3414
260	Worthing Leisure Centre	50.8167	-0.408758
261	Workington Opera House	54.6438	-3.5443
262	Bradford Odeon	53.7925	-1.7565
263	Guildford Civic Hall	51.2386	-0.5663

264 rows × 3 columns

Whenever you work with data, you need to tidy it up. Here, we make sure the co-ordinates are treated as numbers and get rid of any rows that contain missing values.

In [10]:

df['lat'] = df['lat'].astype(float)
df['lon'] = df['lon'].astype(float)
df = df.dropna(how='any', axis=1)

#Preview the data
df

Out[10]:

	venue_name	lat	lon
0	The Coronet	51.4948	-0.098900
1	Pigalle Club	51.5095	-0.135000
2	The Spitz	51.5197	-0.074722
3	First Direct Arena	53.8031	-1.542220
4	First Direct Arena	53.8031	-1.542220
...	...	...	...
259	The Ram Folk Club	51.3823	-0.341400
260	Worthing Leisure Centre	50.8167	-0.408758
261	Workington Opera House	54.6438	-3.544300
262	Bradford Odeon	53.7925	-1.756500
263	Guildford Civic Hall	51.2386	-0.566300

264 rows × 3 columns

Map It...¶

Maps are actually quite easy to work with in code... Often just a couple of lines to pull stuff together, even fewer if we use some magic (but maybe later for that..).

In [6]:

#folium is a package for doing stuff with maps
import folium

For each line of our dataset, plot a corresponding marker on a map...

In [7]:

m = folium.Map(location=[55, 0], zoom_start=5)


for i in range(0,len(df)):
    folium.Marker([df.iloc[i]['lat'], df.iloc[i]['lon']],
                 popup=df.iloc[i]['venue_name']).add_to(m)
    
m

Out[7]:

We can also save the map to an HTML file that we can share around, pop onto websites, etc..

In [8]:

m.save('venues.html')